Introduction to Population Genetics (Lab 2)

Quanitifying Neutral and Adaptive Genetic Variation for Landscape Genetic Studies

Overview of landscape influences on genetic variation

Deoxyribonucleic acid, or DNA, is the genetic material that codes for all of the diversity of life. The information in DNA is stored in four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each base is attached to a sugar molecule and a phosphate molecule to make up the double helix backbone of DNA. The order, or sequence, of bases provides the information for building and maintaining individual organisms. DNA contains genes that code for proteins, which are created through the processes of transcription and translation. During transcription, DNA is translated to messenger RNA (mRNA) and during translation the sequence of mRNA is read. Each sequence of three bases codes for an amino acid and mRNA is converted to a protein one amino acid at a time. Differences in genetic sequencing is known as genetic variation.

Genetic variation can be examinined at locations in the genome under selection (adaptive loci), or locations that are not under selection (neutral loci).

The four main proccess that influence genetic variation are:

1) Mutations Mutation is the process that creates genetics variants or alleles due to errors in DNA replication.

2) Selection Selection impacts genetic variation by increasing the frequencies of alleles that are favorable in a particular environment and decreasing the frequencies of alleles that are less favorable in that environment. If a locus or a mutation is neutral, selection does not directly influence the frequency of alleles.

3) Gene Flow Gene flow occurs as a result of migration (dispersal and subsequent reproduction) and this process moves alleles between populations and tends to make them more similar genetically.

4) Genetic Drift Genetic drift is the change in allele frequencies due to randomsampling effects as alleles are passed on from one generation to the next. Genetic drift is thus influenced by the number of breeding individuals and the variance in reproductive success among breeders.

Genetic drift is directly linked to effective population size (Ne). Ne was developed to mathematically and conceptually represent the effects of genetic drift, and Ne is the population genetic analog to census size used in ecology. Ne is described as the number of breeding individuals in an idealized population that contribute genetically to the next generation. A simplified but useful approach for understanding Ne is to think about it as the number of individuals that are able to successfully pass on their genes to the next generation. Populations with larger Ne will be able to maintain higher amounts of genetic variation because more new alleles are created by mutation due to a larger number of breeding events and fewer alleles are lost to genetic drift since there is a larger sample of breeders each generation. Ne also influences the relative impacts of genetic drift and selection in natural populations . For example, as Ne decreases, the effect of selection decreases relative to the effect of genetic drift and important adaptive genetic variation can be lost.

The Two Main Components of Genetic Varation:

1) Genetic Diversity

Genetic diversity is the amount of genetic variation found within an individual, a population, or a spatial area.

2) Genetic Structure

Genetic structure is the distribution of genetic variation among individuals, populations, or areas.

Landscapes influence the amount and distribution of genetic variation in two main ways:

1) The landscape influences where an organism can live and how many individuals live in a particular location that affects the distribution and amount of genetic variation

2) The landscape influences the amount of movement and subsequent gene flow that occurs among individuals in different locations, which directly affects Ne and the amount and distribution of genetic variation

Overview of DNA types and molecular methods

Two Main Types of DNA in animal cells:

1) Mitochondrial DNA (mtDNA)

MtDNA is a circular DNA molecule of the mitochondrion, an organelle that is the energy powerhouse of cells. MtDNA has a uniparental mode of inheritance in animals and plants as it is generally passed only from mother to offspring, but paternal transmission has been documented in some animals and conifer species.

2) Nuclear DNA (nDNA)

Nuclear DNA is the DNA of chromosomes found in the nucleus of a cell and has a biparental mode of inheritance since genetic material is inherited fromboth the mother and the father.

On average, the Ne of nDNA loci is four times higher than the Ne of cpDNA and mtDNA because organellar DNA is most often uniparentally inherited and haploid. The type of DNA, its mutation rate, and mode of inheritance influence the research questions that can be addressed and the temporal inference that can be obtained from genetic data. In general, nDNA loci are used more frequently in landscape genetic studies (90%) than mtDNA or cpDNA loci (Storfer et al. 2010) because they provide better resolution for detecting recent changes to the landscape.

Adaptive versus Neautral Loci

Molecular Methods

The two main types of nDNA molecular methods:

1)Codominant

In codominant methods like microsatellite analysis or single-nucleotide polymorphism (SNP) analysis, it is possible to visualize both alleles at a particular locus in the form of a genotype (AA versus AB, for example).

2)Dominant

While dominant loci methods, such as amplified fragment length polymorphism (AFLP), create banding patterns of tens to hundreds of bands for each individual that resemble a barcode, both alleles from a particular locus cannot be identified. Dominant loci data are generally recorded in a binary presence/ absence format, where 1 indicates presence and 0 indicates absence (0, 1, 0, 0, 0, 1, for example).

Mitochondrial DNA and cpDNA are generally analyzed using DNA sequencing or restriction fragment length polymorphism (RFLP), approaches that generate haplotype data rather than genotype data since the loci are haploid.

Unit of analysis

Two main types of analysis for genetic data in landscape genetics:

1) Individuals

Individual is chosen as unit of analysis when the species is more continuously distributed across a landscape and there is no population boundaries.

2) Populations

Population is chosen as the unit of analysis for species that are patchily distributed

Important population genetic models

Hardy–Weinberg equilibrium

This rule states that for a diploid organism with two alleles at one locus, if p represents the frequency of one allele and q represents the frequency of the other allele, then genotype frequencies will be p2 + 2pq +q2 =1, whereby p2 represents the pp genotype or the p homozygote frequency, 2pq represents the heterozygote frequency, and q2 represents the q homozygote frequency.

These allele frequencies are expected under Hardy–Weinberg equilibrium (HWE) in one generation of random mating in a population, that:

1) is infinite in size

2) has no selection

3) has no mutation

4) has no migration

5) has random mating

Linkage equilibrium

Under Mendel’s law of independent assortment, it is assumed that the cross-generational transmission of alleles at any particular locus is independent of alleles at other loci and that the fitnesses of possible pairs of alleles are decoupled. It follows, then, that linkage disequilibrium exists when there is a statistical (non-random) association between alleles at different loci. These alleles can be physically linked on chromosomes, whereby they are in near proximity to one another and those inherited together. Alternatively, other evolutionary processes such as genetic drift in small populations can create linkage disequilibrium among pairs of alleles. Moreover, selection can cause genetic “hitchhiking”,where by alleles of little-to-no effect on fitness are inherited together with alleles favored under selection.

Effective population size and genetic drift

Populations are not infinite in size and Ne is almost always smaller than the census population size or the number of individuals estimated from field or other surveys. Unequal sex ratio, assortative mating, and overlapping generations are all examples ofdemographic factors that make effective population sizes smaller than census population sizes.

There are two commonly estimated measures of effective population size:

1) Variance effective size (NeV) is the size of an ideal population experiencing drift at the same rate as the actual population.

2) Inbreeding effective size (NeI) is the size of an ideal population losing heterozygosity, due to increased relatedness, at the same rate as the actual population.

For stable, large populations NeI and NeV are similar, but when populations are growing or shrinking they can differ greatly.

Genetic drift results in random changes in allele frequencies across generations resulting from sampling of gametes, and thus has larger effects on smaller populations than larger populations. In diploid organisms, heterozygosity (the proportion of individuals with 2 different alleles at a locus) is lost at a rate of 1/(2Ne) at each generation due to genetic drift. As such, the reduction of genetic diversity due to drift has become a management concern for small populations or populations of endangered species because maintenance of genetic diversity is recognized as important for the evolutionary potential of populations to respond to future environmental change.

Extreme and rapid declines of (effective) population size result in population bottlenecks and thereby often rapid reduction in genetic diversity resulting from drift.

Mutation

Mutation is the ultimate source of genetic variation in populations and species. Mutations occur when mistakes are made during DNA replication and can take two main forms:

1) Point mutations

Point mutations are changes of a single nucleotide, such as A to G or G to C. As much of the genome is non-coding, point mutations are often considered “silent” as they do not result in changes in protein structure. Even when point mutations do occur in segments of DNA that code for amino acids (i.e., exons), they often do not result in changes of amino acid sequences (and are called synonymous substitutions) due to the conservation of the genetic code (i.e., 64 possible codons code for only 20 amino acids). Point mutations that do change amino acid sequences are called non-synonymous substitutions.

2) Insertions/Deletions of segments of DNA

Insertions or deletions change the length of a strand of DNA and are hence called frameshift mutations when they occur in exons because they shift the reading frame that is transcribed, often resulting in major changes in amino acid sequences or even premature termination of protein formation.

Migration (Gene Flow)

Gene flow is defined as the movement of genes (via successful mating) among populations. High rates of gene flow tend to homogenize populations, thereby resulting in similar allele frequencies. Conversely, low rates of gene flow tend to result in population isolation and genetic differentiation. In turn, if populations have a small Ne, genetic drift will act to differentiate these populations genetically via random changes in allele frequencies. However, it only takes small amounts of migration (about 1 effective migrant per generation in theory) to overcome drift, which is a relatively weak evolutionary force.

Isolation-by-distance and landscape

The concept of isolation-by distance was first envisioned by Sewall Wright (1943, 1951) and assumes that genetic distance among populations is positively correlated with geographic distance among those populations.

Measuring genetic diversity

Population level

When the population is the unit of analysis, there are four main summary metrics for codominant nDNA loci:

1) number of alleles

2) allelic richness

3) heterozyosity

4) proportion of polymorphic loci

Individual level

There are many metrics used to estimate individual heterozygosity using codominant loci:

  1. Heterozygosity (H)

  2. Standardized heterozygosity (SH)

  3. Internal relatedness (IR)

  4. Homozygosity by loci (HL)

Evaluating genetic structure and detecting barriers

One of the key components of a landscape genetic analysis is evaluating genetic structure (i.e., differentiation) among populations and individuals. When populations are panmictic, or completely randomly mating, there are no discernable differences in allele frequencies and estimates of gene flow become infinite.

Population-based measures and Individual-based genetic distance metrics

The oldest and most widely used population-level metric of genetic structure is Wright’s FST, one of three F statistics derived by Sewall Wright. F-statistics were developed to describe the partitioning of genetic variation within a species. FST is a measure of subpopulation-level genetic differentiation relative to the total population that ranges from 0 (panmixia)to 1 (complete genetic isolation) and measures allele frequency divergence among sub populations.

Key Definitions:

  • Adaptive loci: A locus where the frequency of alleles is affected by selection.

  • Neutral Loci: A locus where the frequency of alleles is not affected by selection.

  • Mutation: A change in the DNA sequence of an organism. Mutations that occur in the DNA of germ cells can be passed on to the next generation.

  • Alleles: A unique genetic variant observed at a particular locus.

  • Locus (Loci): A locus is a particular location in the genome of an organism. The term loci is the plural form of locus.

  • Effective Population Size (Ne): The number of breeding individuals in an idealized population that would show the same amount of change in allele frequencies under random genetic drift or the same amount of inbreeding as the population under consideration.

  • Mitochondiral DNA (mtDNA): A circular DNA molecule of the mitochondrion which is haploid and generally passed on only from mother to offspring.

  • Nuclear DNA: The DNA of the chromosomes found in the nucleus of a cell.

  • Chloroplast DNA (cpDNA): A circular DNA molecule found in the chloroplast of plants.

  • Haploid: An organism or a cell that contains only one set of chromosomes in its genome.

  • Genotype: The combination of alleles observed at a particular locus for a specific individual.

  • Haplotype: An allele or combination of alleles passed on from a single parent.

  • Homozygote: In a diploid organism, an individual that has two copies of the same allele on homologous chromosomes.

  • Heterozygote: In a diploid organism, an individual that has two different alleles at each of its homologous chromosomes.

  • Linkage Disequilibrium: Non-random association of alleles at two or more loci; these alleles tend to be inherited together significantly more often than expected by random chance.

  • Inbreeding: Mating between closely related individuals.

  • Bottlenecks: When a population goes through a period where its effective population size is extremely small, resulting in an increase in the effects of genetic drift and a consequent loss of genetic variation.

  • Synonymous Substitutions: Mutations in coding DNA that do not result in changes in amino acid sequence. Often called “silent substitutions”.

  • Non-Synonymous Substitutions: Mutations in coding DNA that result in changes in amino acid sequence.

  • Demes: A deme is a group of individuals that is sufficiently genetically isolated from other groups of individuals and can also be considered a population.

  • Panmixia: Random mating of individuals within a population or geographic area.

  • Assignment Tests: Methods that use genotypic information to evaluate population membership of sampled individuals.

  • Coalescent: A body of theory that investigates time of divergence from a common ancestor. In population genetics, this theory can be applied to understanding differences in allele frequencies among populations.

  • Likelihood-Based: A quantity that describes the probability of the data in a particular statistical model conditional on the model parameters.

Worked Example 1

A) Goals

This worked example shows how to:

  • Check markers and populations (polymorphism, HWE, linkage, null alleles).

  • Assess genetic diversity.

  • Aggregate genetic data at the population level.

B) Data set

Microsatellite data for 181 individuals of Colombia spotted frogs (Rana luteiventris) from 12 populations. Site-level spatial coordinates and attributes. The data are a subsample of the full data set analyzed in Funk et al. (2005) and Murphy et al. (2010). Please see the separate introduction to the data set.

  • ralu.loci: Data frame with populations and genetic data (181 rows x 9 columns). Included in package ‘LandGenCourse.’ To load it, type: data(ralu.loci)

  • ralu.site: Spatial points data frame with spatial coordinates and site variables Included in package GeNetIt’. To load it, type: data(ralu.site)

C) Required R packages

All required packages should have been installed already when you installed ‘LandGenCourse.’

require(adegenet)
## Loading required package: adegenet
## Loading required package: ade4
## 
##    /// adegenet 2.1.5 is loaded ////////////
## 
##    > overview: '?adegenet'
##    > tutorials/doc/questions: 'adegenetWeb()' 
##    > bug reports/feature requests: adegenetIssues()
require(LandGenCourse)
## Loading required package: LandGenCourse
require(pegas)       
## Loading required package: pegas
## Loading required package: ape
## Registered S3 method overwritten by 'pegas':
##   method      from
##   print.amova ade4
## 
## Attaching package: 'pegas'
## The following object is masked from 'package:ape':
## 
##     mst
## The following object is masked from 'package:ade4':
## 
##     amova
require(sp)
## Loading required package: sp
require(PopGenReport)
## Loading required package: PopGenReport
## Loading required package: knitr
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Registered S3 method overwritten by 'genetics':
##   method      from 
##   [.haplotype pegas
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(poppr) 
## Loading required package: poppr
## This is poppr version 2.9.3. To get started, type package?poppr
## OMP parallel support: unavailable

Basic checking of markers and populations

Before we do landscape genetic analysis, we need to perform a basic population genetic analysis of the genetic data, in order to better understand the nature and quality of the data and to check for underlying assumptions of population genetic models and corresponding methods.

A) Re-create genind object

Adapted from Week 1 tutorial:

Note: we use the double colon notation ‘package::function(argument)’ to indicate, for each function, which package it belongs to.

data(ralu.loci, package="LandGenCourse")
Frogs <- data.frame(FrogID = paste(substr(ralu.loci$Pop, 1, 3), 
                                   row.names(ralu.loci), sep="."), ralu.loci)
Frogs.genind <- adegenet::df2genind(X=Frogs[,c(4:11)], sep=":", ncode=NULL, 
                          ind.names= Frogs$FrogID, loc.names=NULL, 
                          pop=Frogs$Pop, NA.char="NA", ploidy=2, 
                          type="codom", strata=NULL, hierarchy=NULL)
Frogs.genind
## /// GENIND OBJECT /////////
## 
##  // 181 individuals; 8 loci; 39 alleles; size: 55.5 Kb
## 
##  // Basic content
##    @tab:  181 x 39 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 3-9)
##    @loc.fac: locus factor for the 39 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: adegenet::df2genind(X = Frogs[, c(4:11)], sep = ":", ncode = NULL, 
##     ind.names = Frogs$FrogID, loc.names = NULL, pop = Frogs$Pop, 
##     NA.char = "NA", ploidy = 2, type = "codom", strata = NULL, 
##     hierarchy = NULL)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 7-23)

Check that markers are polymorphic

The genetic resolution depends on the number of markers and their polymorphism. The table above and the summary function for genind objects together provide this information. Now we run the summary function:

summary(Frogs.genind)
## 
## // Number of individuals: 181
## // Group sizes: 21 8 14 13 7 17 9 20 19 13 17 23
## // Number of alleles per locus: 3 4 4 4 9 3 4 8
## // Number of alleles per group: 21 21 20 22 20 19 19 25 18 14 18 26
## // Percentage of missing data: 10.64 %
## // Observed heterozygosity: 0.1 0.4 0.09 0.36 0.68 0.02 0.38 0.68
## // Expected heterozygosity: 0.17 0.47 0.14 0.59 0.78 0.02 0.48 0.74

The output of the summary function shows us the following:

  • 8 loci with 3 - 9 alleles (39 in total)

  • Expected heterozygosity varies between 0.14 (locus C) and 0.78 (locus E)

  • There’s a reasonable level of missing values (10.6%)

C. Check for deviations from Hardy-Weinberg equilibrium (HWE)

See also: http://dyerlab.github.io/applied_population_genetics/hardy-weinberg-equilibrium.html

For a very large population (no drift) with random mating and non-overlapping generations (plus a few more assumptions about the mating system), and in the absence of mutation, migration (gene flow) and selection, we can predict offspring genotype frequencies from allele frequencies of the parent generation (Hardy-Weinberg equilibrium). In general, we don’t expect all of these assumptions to be met (e.g., if we want to study gene flow or selection, we kind of expect that these processes are present). Note: plants often show higher levels of departure from HWE than animals.

Here are p-values for two alternative tests of deviation from HWE for each locus. Columns:

  • chi^2: value of the classical chi-squared test statistic

  • df: degrees of freedom of the chi-squared test

  • Pr(chi^2 >): p-value of the chi-squared test (‘>’ indicates that the alternative is ‘greater,’ which is always the case for a chi-squared test)

  • Pr.exact: p-value from an exact test based on Monte Carlo permutation of alleles (for diploids only). The default is B = 1000 permutations (set B = 0 to skip this test). Here we use the function ‘round’ with argument ‘digits = 3’ to round all values to 3 decimals.

round(pegas::hw.test(Frogs.genind, B = 1000), digits = 3)
##     chi^2 df Pr(chi^2 >) Pr.exact
## A  40.462  3       0.000    0.000
## B  17.135  6       0.009    0.041
## C 136.522  6       0.000    0.000
## D  83.338  6       0.000    0.000
## E 226.803 36       0.000    0.000
## F   0.024  3       0.999    1.000
## G  12.349  6       0.055    0.007
## H  76.813 28       0.000    0.000

Both tests suggest that all loci except for locus “F” are out of HWE globally (across all 181 individuals). Next, we check for HWE of each locus in each population.

Notes on the code: The curly brackets ‘{ }’ below are used to keep the output from multiple lines together in the html file. Function ‘seppop’ splits the genind object by population. We use ‘sapply’ to apply the function ‘hw.test’ from package ‘pegas’ to each population (see this week’s video and tutorial). We set ‘B=0’ to specify that we don’t need any permutations right now. The function ‘t’ takes the transpose of the resulting matrix, which means it flips rows and columns. This works on a matrix, not a data frame, hence we use ‘data.matrix’ to temporarily interpret the data frame as a matrix.

# Chi-squared test: p-value
HWE.test <- data.frame(sapply(seppop(Frogs.genind), 
                              function(ls) pegas::hw.test(ls, B=0)[,3]))
HWE.test.chisq <- t(data.matrix(HWE.test))
{cat("Chi-squared test (p-values):", "\n")
round(HWE.test.chisq,3)}
## Chi-squared test (p-values):
##                A     B     C     D     E     F     G     H
## Airplane   0.092 0.359 1.000 0.427 0.680 1.000 0.178 0.051
## Bachelor   1.000 0.557 0.576 0.686 0.716 1.000 0.414 0.609
## BarkingFox 0.890 0.136 0.005 0.533 0.739 0.890 0.708 0.157
## Bob        0.764 0.864 0.362 0.764 0.033 1.000 0.860 0.287
## Cache      1.000 0.325 0.046 0.659 0.753 1.000 0.709 0.402
## Egg        1.000 0.812 1.000 1.000 0.156 1.000 0.477 0.470
## Frog       1.000 0.719 0.070 0.722 0.587 1.000 0.564 0.172
## GentianL   0.809 0.059 1.000 0.028 0.560 0.717 0.474 0.108
## ParagonL   1.000 0.054 0.885 0.709 0.868 1.000 0.291 0.000
## Pothole    1.000 1.000 1.000 0.488 0.248 1.000 0.296 0.850
## ShipIsland 0.807 0.497 1.000 0.521 0.006 1.000 0.498 0.403
## Skyhigh    0.915 0.493 0.063 0.001 0.155 1.000 0.126 0.078

Let’s repeat this with a Monte Carlo permutation test with B = 1000 replicates:

# Monte Carlo: p-value
HWE.test <- data.frame(sapply(seppop(Frogs.genind), 
                              function(ls) pegas::hw.test(ls, B=1000)[,4]))
HWE.test.MC <- t(data.matrix(HWE.test))
{cat("MC permuation test (p-values):", "\n")
round(HWE.test.MC,3)}
## MC permuation test (p-values):
##                A     B     C     D     E F     G     H
## Airplane   0.015 1.000 1.000 0.415 0.623 1 0.246 0.008
## Bachelor   1.000 0.460 1.000 1.000 0.855 1 0.499 0.603
## BarkingFox 1.000 0.221 0.067 1.000 0.753 1 1.000 0.162
## Bob        1.000 1.000 1.000 1.000 0.019 1 1.000 0.278
## Cache      1.000 0.367 0.131 1.000 1.000 1 1.000 0.611
## Egg        1.000 1.000 1.000 1.000 0.087 1 0.548 0.475
## Frog       1.000 1.000 0.082 1.000 0.464 1 1.000 0.171
## GentianL   1.000 0.065 1.000 0.077 0.649 1 0.621 0.142
## ParagonL   1.000 0.152 1.000 1.000 1.000 1 0.344 0.087
## Pothole    1.000 1.000 1.000 1.000 0.547 1 0.543 1.000
## ShipIsland 1.000 0.611 1.000 0.697 0.141 1 0.510 0.422
## Skyhigh    1.000 0.357 0.182 0.088 0.107 1 0.070 0.032

To summarize, let’s calculate, for each locus, the proportion of populations where it was out of HWE. Here we’ll use the conservative cut-off of alpha = 0.05 for each test. There are various ways of modifying this, including a simple Bonferroni correction, where we divide alpha by the number of tests, which you can activate here by removing the ## i. front of the line.

We write the results into a data frame ‘Prop.loci.out.of.HWE’ and use ‘=’ to specify the name for each column.

alpha=0.05
Prop.loci.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 2, mean), 
           MC=apply(HWE.test.MC<alpha, 2, mean))
Prop.loci.out.of.HWE             # Type this line again to see results table
##        Chisq         MC
## A 0.00000000 0.08333333
## B 0.00000000 0.00000000
## C 0.16666667 0.00000000
## D 0.16666667 0.00000000
## E 0.16666667 0.08333333
## F 0.00000000 0.00000000
## G 0.00000000 0.00000000
## H 0.08333333 0.16666667

And similarly, for each population, the proportion of loci that were out of HWE:

Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean), 
           MC=apply(HWE.test.MC<alpha, 1, mean))
Prop.pops.out.of.HWE             
##            Chisq    MC
## Airplane   0.000 0.250
## Bachelor   0.000 0.000
## BarkingFox 0.125 0.000
## Bob        0.125 0.125
## Cache      0.125 0.000
## Egg        0.000 0.000
## Frog       0.000 0.000
## GentianL   0.125 0.000
## ParagonL   0.125 0.000
## Pothole    0.000 0.000
## ShipIsland 0.125 0.000
## Skyhigh    0.125 0.125

The results suggest that:

  • While most loci are out of HWE globally, this is largely explained by subdivision (variation in allele frequencies among local populations indicating limited gene flow).

  • No locus is consistently out of HWE across populations (loci probably not affected by selection).

  • No population is consistently out of HWE across loci (probably no recent major bottlenecks/ founder effects).

Let’s repeat this with ‘false discovery rate’ correction for the number of tests. Here we use the function ‘p.adjust’ with the argument ‘method=“fdr”’ to adjust the p-values from the previous tests. This returns a vector of length 96 (the number of p-values used), which we convert back into a matrix of 12 rows (pops) by 8 columns (loci). Then we proceed as above.

Chisq.fdr <- matrix(p.adjust(HWE.test.chisq,method="fdr"), 
                    nrow=nrow(HWE.test.chisq))
MC.fdr <- matrix(p.adjust(HWE.test.MC, method="fdr"), 
                    nrow=nrow(HWE.test.MC))

Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean), 
           MC=apply(HWE.test.MC<alpha, 1, mean),
           Chisq.fdr=apply(Chisq.fdr<alpha, 1, mean),
           MC.fdr=apply(MC.fdr<alpha, 1, mean))
Prop.pops.out.of.HWE             
##            Chisq    MC Chisq.fdr MC.fdr
## Airplane   0.000 0.250     0.000      0
## Bachelor   0.000 0.000     0.000      0
## BarkingFox 0.125 0.000     0.000      0
## Bob        0.125 0.125     0.000      0
## Cache      0.125 0.000     0.000      0
## Egg        0.000 0.000     0.000      0
## Frog       0.000 0.000     0.000      0
## GentianL   0.125 0.000     0.000      0
## ParagonL   0.125 0.000     0.125      0
## Pothole    0.000 0.000     0.000      0
## ShipIsland 0.125 0.000     0.000      0
## Skyhigh    0.125 0.125     0.125      0

After using false discovery rate correction for the 8 * 12 = 96 tests performed, very few combinations of locus and population were out of HWE based on the chi-squared test, and none with the MC test.

Note: exact results are likely to differ somewhat between runs due to the permutation tests.

D. Check for linkage disequilibrium (LD)

See also: https://grunwaldlab.github.io/Population_Genetics_in_R/Linkage_disequilibrium.html

For microsatellite markers, we typically don’t know where on the genome they are located. The closer together two markers are on a chromosome, the more likely they are inherited together, which means that they don’t really provide independent information. Testing for linkage disequilibrium assesses this, for each pair of loci, by checking whether alleles of two loci are statistically associated.

This step is especially important when developing a new set of markers. You may want to drop (the less informative) one marker of any pair of linked loci.

Here, we start with performing an overall test of linkage disequilibrium (the null hypothesis is that there is no linkage among the set of markers). Two indices are calculated and tested: an index of association (Ia; Brown et al. 1980) and a measure of correlation (rbarD; Agapow and Burt 2001), which is less biased (see URL above). The number of permutations is specified by ‘sample = 199.’

Overall, there is statistically significant association among the markers (p-value: prD = 0.005; also left figure). Recall that the power of a statistical test increases with sample size, and here we have n = 181, hence even a small effect may be statistically significant. Hence we look at effect size, i.e., the actual strength of the pairwise associations (right figure).

poppr::ia(Frogs.genind, sample=199)

##         Ia       p.Ia      rbarD       p.rD 
## 0.33744318 0.00500000 0.05366542 0.00500000
LD.pair <- poppr::pair.ia(Frogs.genind)

LD.pair
##          Ia   rbarD
## A:B  0.0485  0.0492
## A:C -0.0314 -0.0335
## A:D  0.1886  0.1966
## A:E  0.0560  0.0569
## A:F -0.0272 -0.0452
## A:G  0.0931  0.0935
## A:H  0.0294  0.0304
## B:C -0.0329 -0.0375
## B:D  0.0903  0.0911
## B:E  0.0910  0.0910
## B:F -0.0013 -0.0025
## B:G  0.0451  0.0452
## B:H  0.0621  0.0623
## C:D -0.0859 -0.1049
## C:E  0.0247  0.0284
## C:F -0.0311 -0.0397
## C:G -0.0107 -0.0118
## C:H  0.0012  0.0015
## D:E  0.0455  0.0458
## D:F  0.0094  0.0199
## D:G  0.0069  0.0070
## D:H  0.0461  0.0462
## E:F  0.0013  0.0025
## E:G  0.0453  0.0454
## E:H  0.2153  0.2159
## F:G  0.0167  0.0299
## F:H  0.0296  0.0606
## G:H  0.0942  0.0953

The strongest correlation is around 0.2, for markers E and H.

Effect size: If rbarD can be interpreted similarly to a linear correlation coefficient r, that would mean that less than 5% of the variation in one marker is shared with the other marker (recall from stats: the amount of variance explained in regression, Rsquared, is the square of the linear correlation coefficient). This is probably not large enough to worry about.

E. Check for null alleles

See also Dakin and Avise (2004): http://www.nature.com/articles/6800545

One potential drawback for microsatellites as molecular markers is the presence of null alleles that fail to amplify, thus they couldn’t be detected in the PCR assays.

The function ‘null.all’ takes a genind object and returns a list with two components (‘homozygotes’ and ‘null.allele.freq’), and each of these is again a list. See ‘?null.all’ for details and choice of method.

List ‘homozygotes’:

  • homozygotes$observed: observed number of homozygotes for each allele at each locus

  • homozygotes$bootstrap: distribution of the expected number of homozygotes

  • homozygotes$probability.obs: probability of observing the number of homozygotes

Note: we are turning off warnings here (currently the code throws a warning for each sample, though results don’t seem to be affected). Code takes a moment to run.

# Null alleles: depends on method! See help file.
Null.alleles <- PopGenReport::null.all(Frogs.genind)
Null.alleles$homozygotes$probability.obs
##   Allele-1 Allele-2 Allele-3 Allele-4 Allele-5 Allele-6 Allele-7 Allele-8
## A    0.098    0.000    0.027       NA       NA       NA       NA       NA
## B    0.151    0.045    0.072    0.003       NA       NA       NA       NA
## C    0.182    0.003    0.000    0.005       NA       NA       NA       NA
## D    0.000    0.000    0.196    0.004       NA       NA       NA       NA
## E    0.055    0.013    0.020    0.266    0.064    0.043    0.742    0.531
## F    0.466    0.001    0.020       NA       NA       NA       NA       NA
## G    0.074    0.087    0.016    0.001       NA       NA       NA       NA
## H    0.452    0.119    0.009    0.023    0.260    0.307    0.007    0.002
##   Allele-9
## A       NA
## B       NA
## C       NA
## D       NA
## E        0
## F       NA
## G       NA
## H       NA

List ‘null.allele.freq’:

  • null.allele.freq$summary1: null allele frequency estimates based upon the forumulas of Chakraborty et al. (1994)

  • null.allele.freq$summary2: null allele frequency estimates based upon the forumulas of Brookfield (1996)

From the help file: “Brookfield (1996) provides a brief discussion on which estimator should be used. In summary, it was recommended that Chakraborty et al. (1994)’s method (e.g. summary1) be used if there are individuals with no bands at a locus seen, but they are discounted as possible artefacts. If all individuals have one or more bands at a locus then Brookfield (1996)’s method (e.g. summary2) should be used.” In this case, we have many individuals with missing values for both alleles, hence better use summary1.

Each summary table contains a summary with observed, median, 2.5th percentile and 97.5the percentile. The percentiles form a 95% confidence interval. From the help file: “If the 95% confidence interval includes zero, it indicates that the frequency of null alleles at a locus does not significantly differ from zero.”

{cat(" summary1 (Chakraborty et al. 1994):", "\n")
round(Null.alleles$null.allele.freq$summary1,2)} 
##  summary1 (Chakraborty et al. 1994):
##                       A    B    C    D    E     F    G    H
## Observed frequency 0.24 0.08 0.23 0.25 0.06  0.00 0.11 0.04
## Median frequency   0.24 0.08 0.22 0.25 0.06  0.00 0.11 0.04
## 2.5th percentile   0.08 0.02 0.03 0.17 0.02 -0.01 0.01 0.00
## 97.5th percentile  0.42 0.16 0.48 0.33 0.11  0.00 0.22 0.09
{cat("summary2 (Brookfield et al. 1996):", "\n")
round(Null.alleles$null.allele.freq$summary2,2)}   
## summary2 (Brookfield et al. 1996):
##                       A    B    C    D    E F    G    H
## Observed frequency 0.06 0.05 0.05 0.17 0.05 0 0.07 0.04
## Median frequency   0.06 0.05 0.05 0.17 0.05 0 0.07 0.03
## 2.5th percentile   0.02 0.01 0.01 0.12 0.02 0 0.01 0.00
## 97.5th percentile  0.10 0.10 0.10 0.23 0.10 0 0.13 0.08

For this example, both methods suggest that there may be null alleles in most loci. However, the estimates of the frequency of null alleles differ a lot between the two methods.

A different approach for estimating null alleles at microsatellite loci, based on the Estimation-Maximization algorithm, is implemented in FreeNA (outside of the R environment). FreeNA will directly provide Fst values and some other measurements using the corrected allele frequencies: https://www1.montpellier.inra.fr/CBGP/software/FreeNA/

Relevant papers for the Estimation-Maximization algorithm:

  • Kalinowski et al. (2006), Conservation Genetics 7:991–995, doi: 10.1007/s10592-006-9134-9.

  • Chapuis and Estoup (2007), Mol. Biol. Evol. 24:621–631, doi: 10.1093/molbev/msl191.

F. Overall interpretation

Spatial genetic structure: The Columbia spotted frog data used in this lab come from a study area with a great deal of genetic structure. If we use a population assignment test (see Week 9), each basin is a separate unit with significant substructure within basins. Testing for significant genetic distance, most pairs of ponds have a genetic distance that is significantly different from zero.

HWE: Therefore, we expect global estimates (i.e., the whole dataset) of He to be out of HWE due to population substructure (HWE assumes panmictic populations). We would also expect data to be out of HWE when analyzing data by basin due to population substructure. We could, then, test HWE and linkage disequilibrium at the pond level (as shown here). However, some ponds have low sample sizes (which refleect a low number of individuals: based on repeated surveys of sites, most if not nearly all animals were captured).

Linkage: These low samples sizes can result in deviations from HWE and in linkage disequilibrium as an artifact of sample size and/or breeding structure (if one male is responsible for all breeding, his genes would appear “linked” in offspring genotypes, even though they are not physically linked in the genome). In addition, each pond may not represent a “population,” but only a portion of a population.

So, what do we do? We can look for patterns. Are there loci that are consistently out of HWE across samples sites while other loci are not out of HWE suggesting that there are null alleles or other data quality control issues? With the full data set, this was not the case. Are there loci that are consistently linked across different ponds (while others are not), suggesting that they are linked? With the full dataset, this was not the case.

Null alleles: Missing data can imply null (non-amplifying) alleles. In this case, while there are loci that have higher “drop-out” rates than others, this is more likely due to some loci being more difficult to call (the authors were very strict in removing any questionable data; equivocal calls resulting in no data). In addition, some of the samples used in this study were toe clips which with low yields of DNA, resulting in incomplete genotypes. While presence of null alleles is a possibility, genetic structure, breeding patterns at low population size and aggressive quality control of genotypes can all explain the results. Finally, with all of the structure in these data, there are examples (at the basin level and pond level) of unique alleles that are fixed or nearly fixed. When the data are assessed globally, this will result in a similar pattern to the presence of null alleles and will result in positive null allele tests.

Assess genetic diversity

These measures are typically quantified per population.

A. Rarefied allelic richness

Both nominal sample size (number of frogs sampled) and valid sample size (e.g., for each locus, the number of frogs with non-missing genetic data) vary between sites. We would expect the number of alleles found in a population to increase with the number of individuals genotyped.

We can check this by plotting the number of alleles against sample size. Here we create an object ‘Sum’ that contains the summary of the genind object, then we can access its elements by ‘$’ to plot what we need. The function ‘names’ lists the names of the elements, which reduced the guesswork.

Sum <- summary(Frogs.genind)
names(Sum)
## [1] "n"         "n.by.pop"  "loc.n.all" "pop.n.all" "NA.perc"   "Hobs"     
## [7] "Hexp"

The site names are quite long, hence we print the labels vertically by setting ‘las=3,’ and we modify the margins (‘mar’). The four numbers give the size of each margin in the following order: bottom, left, top, right.

We add a regression line to the scatterplot with the function ‘abline,’ where we specify the linear regression model with the function ‘lm.’ In this case, we model the response ‘pop.n.all’ as a function of predictor ‘n.by.pop.’

The barchart (left) shows that there is considerable variation among ponds in the number of alleles observed across all loci. The scatterplot (right) with the red regression line shows that the number of alleles increases with sample size.

par(mar=c(5.5, 4.5,1,1))
barplot(Sum$pop.n.all, las=3, 
       xlab = "", ylab = "Number of alleles")

plot(Sum$n.by.pop, Sum$pop.n.all, 
       xlab = "Sample size", ylab = "Number of alleles")
abline(lm(Sum$pop.n.all ~ Sum$n.by.pop), col = "red")

Hence we should not compare the number of alleles directly. Instead, we’ll use rarefied allelic richness (Ar).

By default, the function ‘allel.rich’ finds the lowest valid sample size across all populations and loci, and multiplies it by the ploidy level. The number is stored as ‘Richness$alleles.sampled’ (here: 3 individuals * 2 alleles = 6 alleles). Alternatively, this number can be set with the ‘min.alleles’ argument.

Richness <- PopGenReport::allel.rich(Frogs.genind, min.alleles = NULL)
Richness$alleles.sampled
## [1] 6

Populations with more alleles are resampled to determine the average allelic richness among the minimum number of alleles. Here, this means that 6 alleles are sampled from each population, allelic richness is calculated, and the process is repeated many times to determine the average).

Let’s plot the results again. The barchart shows that there is considerable variation in genetic diversity among ponds. The scatterplot against sample size (here: for each population, the average number of valid alleles across loci) suggests that the variation is not related to sample size. The regression line (red) is almost horizontal.

Here we plot the average Ar across loci, so that the result does not depend on the number of loci used.

par(mar=c(5.5, 4.5,1,1))
barplot(Richness$mean.richness, las=3, ylab="Rarefied allelic richness (Ar)")

plot(colMeans(Richness$pop.sizes), Richness$mean.richness,
     xlab="Valid sample size", 
     ylab="Rarefied allelic richness (Ar)")
abline(lm(Richness$mean.richness ~ colMeans(Richness$pop.sizes)), col="red")

B. Observed and expected heterozygosity

Note: Writing the ‘genind’ summary into an object ‘Sum’ allows accessing its attributes by name.

  Sum <- summary(Frogs.genind)
  names(Sum)
## [1] "n"         "n.by.pop"  "loc.n.all" "pop.n.all" "NA.perc"   "Hobs"     
## [7] "Hexp"

Expected heterozygosity (here: Hexp) is the heterozygosity expected in a population under HWE, and observed heterozygosity (here: Hobs) is the observed number of heterozygotes at a locus divided by the total number of genotyped individuals. Here are the global values (pooled across all populations):

  par(mar=c(3, 4.5,1,1))
  barplot(Sum$Hexp, ylim=c(0,1), ylab="Expected heterozygosity")

  barplot(Sum$Hobs, ylim=c(0,1), ylab="Observed heterozygosity")

By locus and population:

Here we use ‘seppop’ to split the genind object by population, then ‘sapply’ to apply function ‘summary’ to each population.

  Hobs <- t(sapply(seppop(Frogs.genind), function(ls) summary(ls)$Hobs))
  Hexp <- t(sapply(seppop(Frogs.genind), function(ls) summary(ls)$Hexp))
  {cat("Expected heterozygosity (Hexp):", "\n")
  round(Hexp, 2)
  cat("\n", "Observed heterozygosity (Hobs):", "\n")
  round(Hobs, 2)}
## Expected heterozygosity (Hexp): 
## 
##  Observed heterozygosity (Hobs):
##               A    B    C    D    E    F    G    H
## Airplane   0.25 0.33 0.00 0.33 0.62 0.00 0.16 0.74
## Bachelor   0.00 0.38 0.40 0.25 0.88 0.00 0.33 0.75
## BarkingFox 0.07 0.14 0.00 0.57 0.79 0.07 0.22 0.23
## Bob        0.15 0.38 0.42 0.15 0.92 0.00 0.11 0.67
## Cache      0.00 0.83 0.00 0.29 1.00 0.00 0.40 0.86
## Egg        0.00 0.41 0.00 0.00 0.87 0.00 0.36 0.87
## Frog       0.00 0.38 0.14 0.56 0.86 0.00 0.33 1.00
## GentianL   0.11 0.95 0.00 0.74 0.90 0.15 0.56 0.84
## ParagonL   0.00 0.11 0.08 0.16 0.33 0.00 0.36 0.47
## Pothole    0.00 0.00 0.00 0.33 0.50 0.00 0.57 0.73
## ShipIsland 0.41 0.41 0.00 0.59 0.53 0.00 0.36 0.71
## Skyhigh    0.04 0.57 0.11 0.30 0.52 0.00 0.77 0.59

Locus F shows variation only in two populations (i.e., Hexp = 0 in 10 populations).

Let’s plot the average across all loci for each population:

Here we use ‘apply’ to apply the function ‘mean’ to the rows (MARGIN = 1). For columns, use ‘MARGIN = 2.’

  par(mar=c(5.5, 4.5, 1, 1))
  Hobs.pop <- apply(Hobs, MARGIN = 1, FUN = mean)
  Hexp.pop <- apply(Hexp, 1, mean) 
  barplot(Hexp.pop, ylim=c(0,1), las=3, ylab="Expected heterozygosity")

  barplot(Hobs.pop, ylim=c(0,1), las=3, ylab="Observed heterozygosity")

C. Create table with sitel-level genetic diversity measures

Frogs.diversity <- data.frame(Pop = names(Hobs.pop),
                              n = Sum$n.by.pop,
                              Hobs = Hobs.pop,
                              Hexp = Hexp.pop,
                              Ar = Richness$mean.richness)
Frogs.diversity
##                   Pop  n      Hobs      Hexp       Ar
## Airplane     Airplane 21 0.3038064 0.3433019 1.939629
## Bachelor     Bachelor  8 0.3729167 0.3651953 2.003673
## BarkingFox BarkingFox 14 0.2619811 0.2818609 1.741904
## Bob               Bob 13 0.3504274 0.3171011 1.961791
## Cache           Cache  7 0.4220238 0.3923413 2.085115
## Egg               Egg 17 0.3135918 0.2951215 1.841382
## Frog             Frog  9 0.4079861 0.3659439 1.925469
## GentianL     GentianL 20 0.5296418 0.4529000 2.331599
## ParagonL     ParagonL 19 0.1880302 0.2200536 1.539923
## Pothole       Pothole 13 0.2665043 0.2328137 1.597687
## ShipIsland ShipIsland 17 0.3755252 0.3918983 1.949255
## Skyhigh       Skyhigh 23 0.3632542 0.3552167 1.953610

You can save the R object ‘Frogs.diversity’ with the code below (need to uncomment by removing the hashtags ‘#’):

#require(here)
#if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))
#save(Frogs.diversity, file = paste0(here(),"/output/Frogs.diversity.RData"))
#load(paste0(here(),"/output/Frogs.diversity.RData"))

4. Aggregate genetic data at population level (allele frequencies)

For some analyses, we will need to aggregate data from the individual to the population level, e.g. as a table of allele frequencies per population.

Here we convert the ‘genind’ object to a ‘genpop’ object (NOT the same as a ‘genepop’ object!). This is defined in the package ‘adegenet’ to hold population-level genetic data. The function ‘genind2genpop’ obviously converts from ‘genind’ to ‘genpop.’

Frogs.genpop <- adegenet::genind2genpop(Frogs.genind)
## 
##  Converting data from a genind to a genpop object... 
## 
## ...done.

The function ‘makefreq’ extracts the table with allele frequencies from the ‘genpop’ object. We’ll plot just a few lines and alleles.

Freq <- adegenet::makefreq(Frogs.genpop)
## 
##  Finding allelic frequencies from a genpop object... 
## 
## ...done.
round(Freq[1:6,1:10], 2)
##             A.1  A.2  A.3  B.1  B.3  B.2  B.4  C.1  C.2 C.4
## Airplane   0.62 0.35 0.03 0.83 0.17 0.00 0.00 1.00 0.00   0
## Bachelor   1.00 0.00 0.00 0.56 0.06 0.38 0.00 0.80 0.20   0
## BarkingFox 0.96 0.00 0.04 0.86 0.04 0.11 0.00 0.88 0.12   0
## Bob        0.92 0.00 0.08 0.81 0.00 0.12 0.08 0.79 0.21   0
## Cache      1.00 0.00 0.00 0.42 0.00 0.50 0.08 0.75 0.25   0
## Egg        1.00 0.00 0.00 0.74 0.00 0.26 0.00 1.00 0.00   0

The allele frequencies of all alleles from the same locus (e.g., A.1, A.2 and A.3) should sum to 1 for each population. With eight loci, the row sums should thus add to 8.

apply(Freq, MARGIN = 1, FUN = sum)    # Just checking
##   Airplane   Bachelor BarkingFox        Bob      Cache        Egg       Frog 
##          8          8          8          8          8          8          8 
##   GentianL   ParagonL    Pothole ShipIsland    Skyhigh 
##          8          8          8          8          8

Lab Example 2

In this week of Landscape genetics we will familiarize ourselves with common genetic diversity analyses using the Kentucky Arrow Darter set used in Blanton et al 2019.

First we will load in the neccesary packages (if not loaded in above):

library(adegenet)
library(gstudio)
library(LandGenCourse)
library(tibble)
library(here)
library(vcfR)
library(pinfsc50)
library(utils)

Next, we will need to load in the Kentucky Arrow Darter genetic data.

kad_data <- read.csv("~/Desktop/landscape_genetics_2022/labs/landgen_lab2_geneticdiversity/kad_dataset.csv")

A. Re-create genind object

Kad.genind <- df2genind(X=kad_data[,c(3:13)], sep=":", ncode=NULL, ind.names= kad_data$Useful_ID, loc.names=NULL, pop=kad_data$Population_ID, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL)

Kad.genind
## /// GENIND OBJECT /////////
## 
##  // 213 individuals; 11 loci; 117 alleles; size: 138.6 Kb
## 
##  // Basic content
##    @tab:  213 x 117 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 3-31)
##    @loc.fac: locus factor for the 117 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = kad_data[, c(3:13)], sep = ":", ncode = NULL, ind.names = kad_data$Useful_ID, 
##     loc.names = NULL, pop = kad_data$Population_ID, NA.char = "NA", 
##     ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 9-31)

B. Check that markers are polymorphic

The genetic resolution depends on the number of markers and their polymorphism. The table above and the summary function for genind objects together provide this information. Now we run the summary function:

summary(Kad.genind)
## 
## // Number of individuals: 213
## // Group sizes: 24 19 26 24 9 27 14 31 18 21
## // Number of alleles per locus: 5 3 10 8 10 10 10 11 31 16 3
## // Number of alleles per group: 60 29 51 25 14 41 41 44 62 58
## // Percentage of missing data: 2.94 %
## // Observed heterozygosity: 0.3 0.04 0.45 0.42 0.63 0.51 0.7 0.49 0.69 0.6 0.22
## // Expected heterozygosity: 0.54 0.19 0.63 0.55 0.83 0.84 0.88 0.82 0.94 0.88 0.48

The output of the summary function shows us the following:

11 loci with 3 - 31 alleles (117 in total) Expected heterozygosity varies between 0.19 and 0.94 There is a small amount of missing values (2.94%)

C. Check for deviations from Hardy-Weinberg equilibrium (HWE)

See the first lab example for explanations of HWE

The default is B = 1000 permutations (set B = 0 to skip this test). Here we use the function ‘round’ with argument ‘digits = 3’ to round all values to 3 decimals.

round(pegas::hw.test(Kad.genind, B = 1000), digits = 3)
##         chi^2  df Pr(chi^2 >) Pr.exact
## C2     89.599  10           0        0
## C3    345.456   3           0        0
## C6    323.814  45           0        0
## C117  434.427  28           0        0
## D131  142.252  45           0        0
## C26b  529.005  45           0        0
## Cv12  197.543  45           0        0
## D10   307.827  55           0        0
## D11  1402.630 465           0        0
## D107  591.560 120           0        0
## D116   63.818   3           0        0

Both tests suggest that all loci are out of HWE globally (across all 213 individuals). This is indicated by the 0 values (they would be 1 if they were in HWE)

Next, we check for HWE of each locus in each population.

Notes on the code: The curly brackets ‘{ }’ below are used to keep the output from multiple lines together in the html file. Function ‘seppop’ splits the genind object by population. We use ‘sapply’ to apply the function ‘hw.test’ from package ‘pegas’ to each population (see this week’s video and tutorial). We set ‘B=0’ to specify that we don’t need any permutations right now. The function ‘t’ takes the transpose of the resulting matrix, which means it flips rows and columns. This works on a matrix, not a data frame, hence we use ‘data.matrix’ to temporarily interpret the data frame as a matrix.

Chi-squared test: p-value

HWE.test <- data.frame(sapply(seppop(Kad.genind), 
                              function(ls) pegas::hw.test(ls, B=0)[,3]))
HWE.test.chisq <- t(data.matrix(HWE.test))
{cat("Chi-squared test (p-values):", "\n")
round(HWE.test.chisq,3)}
## Chi-squared test (p-values):
##        C2    C3    C6  C117  D131  C26b  Cv12   D10   D11  D107  D116
## X1  0.772 1.000 0.285 0.891 0.547 0.889 0.331 0.000 0.959 0.003 0.392
## X2  1.000 0.498 0.243 0.356 0.678 1.000 0.865 0.000 0.000 0.001 1.000
## X3  0.929 1.000 0.980 0.690 0.425 0.430 0.748 0.464 0.452 0.830 0.190
## X4  0.744 1.000 0.744 1.000 0.744 0.484 0.554 0.910 0.741 0.917 1.000
## X5  1.000 0.003 1.000 1.000 0.987 1.000 1.000 1.000 1.000 1.000 1.000
## X6  1.000 0.000 0.640 0.096 0.866 0.048 0.648 0.182 0.488 0.313 0.760
## X7  0.079 1.000 0.033 0.360 0.105 0.110 0.886 0.962 0.030 0.437 0.684
## X8  1.000 0.000 0.000 0.000 0.720 0.655 0.644 0.764 0.927 0.056 1.000
## X9  0.528 0.000 0.017 0.001 0.872 0.637 0.712 0.009 0.552 0.003 0.494
## X10 0.589 0.000 0.007 0.740 0.693 0.873 0.183 0.036 0.000 0.704 0.992

Let’s repeat this with a Monte Carlo permutation test with B = 1000 replicates:

Monte Carlo: p-value

HWE.test <- data.frame(sapply(seppop(Kad.genind), 
                              function(ls) pegas::hw.test(ls, B=1000)[,4]))
HWE.test.MC <- t(data.matrix(HWE.test))
{cat("MC permuation test (p-values):", "\n")
round(HWE.test.MC,3)}
## MC permuation test (p-values):
##        C2    C3    C6  C117  D131  C26b  Cv12   D10   D11  D107  D116
## X1  0.704 1.000 0.366 1.000 0.423 1.000 0.413 0.000 0.933 0.352 0.706
## X2  1.000 0.637 0.178 0.645 0.638 1.000 1.000 0.021 0.107 0.231 1.000
## X3  0.798 1.000 0.846 0.950 0.078 0.343 0.684 0.144 0.315 0.564 0.199
## X4  1.000 1.000 1.000 1.000 1.000 1.000 0.618 1.000 0.674 1.000 1.000
## X5  1.000 0.064 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## X6  1.000 0.001 0.818 0.229 0.385 0.115 0.707 0.068 0.295 0.208 1.000
## X7  0.036 1.000 0.043 0.575 0.069 0.109 0.920 0.815 0.229 0.253 0.783
## X8  1.000 0.001 0.030 0.004 0.862 1.000 0.522 0.629 0.453 0.003 1.000
## X9  0.559 0.036 0.185 0.008 0.748 0.393 0.495 0.021 0.603 0.148 1.000
## X10 0.643 0.000 0.097 1.000 0.700 0.949 0.045 0.210 0.000 0.862 1.000

To summarize, let’s calculate, for each locus, the proportion of populations where it was out of HWE. Here we’ll use the conservative cut-off of alpha = 0.05 for each test. There are various ways of modifying this, including a simple Bonferroni correction, where we divide alpha by the number of tests, which you can activate here by removing the ## i. front of the line.

We write the results into a data frame ‘Prop.loci.out.of.HWE’ and use ‘=’ to specify the name for each column.

alpha=0.05
Prop.loci.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 2, mean), 
           MC=apply(HWE.test.MC<alpha, 2, mean))
Prop.loci.out.of.HWE 
##      Chisq  MC
## C2     0.0 0.1
## C3     0.5 0.4
## C6     0.4 0.2
## C117   0.2 0.2
## D131   0.0 0.0
## C26b   0.1 0.0
## Cv12   0.0 0.1
## D10    0.4 0.3
## D11    0.3 0.1
## D107   0.3 0.1
## D116   0.0 0.0

And similarly, for each population, the proportion of loci that were out of HWE:

Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean), 
           MC=apply(HWE.test.MC<alpha, 1, mean))
Prop.pops.out.of.HWE 
##          Chisq         MC
## X1  0.18181818 0.09090909
## X2  0.27272727 0.09090909
## X3  0.00000000 0.00000000
## X4  0.00000000 0.00000000
## X5  0.09090909 0.00000000
## X6  0.18181818 0.09090909
## X7  0.18181818 0.18181818
## X8  0.27272727 0.36363636
## X9  0.45454545 0.27272727
## X10 0.36363636 0.27272727

The results suggest that:

  • While most loci are out of HWE globally, this is largely explained by subdivision (variation in allele frequencies among local populations indicating limited gene flow).

  • No locus is consistently out of HWE across populations (loci probably not affected by selection).

  • Some populations are somewhat consistently out of HWE across loci, indicating possible recent major bottlenecks/ founder effects (Populations 9 and 10).

Let’s repeat this with ‘false discovery rate’ correction for the number of tests. Here we use the function ‘p.adjust’ with the argument ‘method=“fdr”’ to adjust the p-values from the previous tests. This returns a vector of length 96 (the number of p-values used), which we convert back into a matrix of 12 rows (pops) by 8 columns (loci). Then we procede as above.

Chisq.fdr <- matrix(p.adjust(HWE.test.chisq,method="fdr"), 
                    nrow=nrow(HWE.test.chisq))
MC.fdr <- matrix(p.adjust(HWE.test.MC, method="fdr"), 
                    nrow=nrow(HWE.test.MC))

Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean), 
           MC=apply(HWE.test.MC<alpha, 1, mean),
           Chisq.fdr=apply(Chisq.fdr<alpha, 1, mean),
           MC.fdr=apply(MC.fdr<alpha, 1, mean))
Prop.pops.out.of.HWE
##          Chisq         MC  Chisq.fdr     MC.fdr
## X1  0.18181818 0.09090909 0.18181818 0.09090909
## X2  0.27272727 0.09090909 0.27272727 0.00000000
## X3  0.00000000 0.00000000 0.00000000 0.00000000
## X4  0.00000000 0.00000000 0.00000000 0.00000000
## X5  0.09090909 0.00000000 0.09090909 0.00000000
## X6  0.18181818 0.09090909 0.09090909 0.09090909
## X7  0.18181818 0.18181818 0.00000000 0.00000000
## X8  0.27272727 0.36363636 0.27272727 0.09090909
## X9  0.45454545 0.27272727 0.27272727 0.00000000
## X10 0.36363636 0.27272727 0.27272727 0.18181818

After using false discovery rate correction for the tests performed, populations 9 and 10 showed a small amount of combinations of locus and populations were out of HWE based on the chi-squared test and MC test.

  • Note: exact results are likely to differ somewhat between runs due to the permutation tests.

D. Check for linkage disequilibrium (LD)

For microsatellite markers, we typically don’t know where on the genome they are located. The closer together two markers are on a chromosome, the more likely they are inherited together, which means that they don’t really provide independent information. Testing for linkage disequilibrium assesses this, for each pair of loci, by checking whether alleles of two loci are statistically associated.

This step is especially important when developing a new set of markers. You may want to drop (the less informative) one marker of any pair of linked loci.

Here, we start with performing an overall test of linkage disequilibrium (the null hypothesis is that there is no linkage among the set of markers). Two indices are calculated and tested: an index of association (Ia; Brown et al. 1980) and a measure of correlation (rbarD; Agapow and Burt 2001), which is less biased (see URL above). The number of permutations is specified by ‘sample = 213.’

Overall, there is no statistical significant association among the markers (p-value: prD = 0.0046; also left figure). Recall that the power of a statistical test increases with sample size, and here we have n = 213, hence even a small effect may be statistically significant. Hence we look at effect size, i.e., the actual strength of the pairwise associations (right figure).

poppr::ia(Kad.genind, sample=213)

##          Ia        p.Ia       rbarD        p.rD 
## 1.154575532 0.004672897 0.117657897 0.004672897
LD.pair <- poppr::pair.ia(Kad.genind)

LD.pair
##                Ia   rbarD
## C2:C3      0.0850  0.0853
## C2:C6      0.0461  0.0463
## C2:C117    0.0787  0.0788
## C2:D131    0.1148  0.1166
## C2:C26b    0.2626  0.2650
## C2:Cv12    0.1495  0.1496
## C2:D10     0.1711  0.1712
## C2:D11     0.1162  0.1331
## C2:D107    0.0749  0.0750
## C2:D116    0.0589  0.0589
## C3:C6      0.0733  0.0733
## C3:C117   -0.0152 -0.0153
## C3:D131    0.0625  0.0628
## C3:C26b    0.0245  0.0245
## C3:Cv12    0.0630  0.0630
## C3:D10    -0.1169 -0.1170
## C3:D11     0.0039  0.0043
## C3:D107   -0.1126 -0.1136
## C3:D116    0.0884  0.0889
## C6:C117    0.0664  0.0665
## C6:D131    0.1312  0.1318
## C6:C26b    0.1781  0.1783
## C6:Cv12    0.1143  0.1144
## C6:D10     0.1518  0.1521
## C6:D11     0.0781  0.0859
## C6:D107    0.0995  0.1005
## C6:D116    0.0723  0.0728
## C117:D131  0.1422  0.1434
## C117:C26b  0.1136  0.1140
## C117:Cv12  0.1513  0.1513
## C117:D10   0.1390  0.1390
## C117:D11   0.0973  0.1089
## C117:D107  0.1131  0.1137
## C117:D116  0.0186  0.0186
## D131:C26b  0.2578  0.2580
## D131:Cv12  0.2358  0.2381
## D131:D10   0.1697  0.1715
## D131:D11   0.1156  0.1229
## D131:D107  0.1933  0.1985
## D131:D116  0.1302  0.1329
## C26b:Cv12  0.1850  0.1858
## C26b:D10   0.3152  0.3168
## C26b:D11   0.2025  0.2185
## C26b:D107  0.2181  0.2220
## C26b:D116  0.0489  0.0495
## Cv12:D10   0.1811  0.1811
## Cv12:D11   0.0657  0.0737
## Cv12:D107  0.2285  0.2295
## Cv12:D116  0.1802  0.1806
## D10:D11    0.1321  0.1488
## D10:D107   0.2548  0.2557
## D10:D116   0.0912  0.0914
## D11:D107   0.1566  0.1840
## D11:D116   0.0720  0.0835
## D107:D116  0.0545  0.0545

The strongest correlation is around 0.3, for markers C26b and D10.

E. Check for null alleles

See above section for more information regarding null alleles

Note: we are turning off warnings here (currently the code throws a warning for each sample, though results don’t seem to be affected).

Null.alleles <- PopGenReport::null.all(Kad.genind)
Null.alleles$homozygotes$probability.obs
##      Allele-1 Allele-2 Allele-3 Allele-4 Allele-5 Allele-6 Allele-7 Allele-8
## C2      0.001    0.000    0.000    0.018    0.005       NA       NA       NA
## C3      0.001    0.000    0.000       NA       NA       NA       NA       NA
## C6      0.007    0.003    0.015    0.015    0.007    0.017    0.000    0.000
## C117    0.021    0.491    0.005    0.000    0.000    0.000    0.012    0.002
## D131    0.002    0.002    0.002    0.002    0.020    0.000    0.013    0.081
## C26b    0.000    0.265    0.000    0.042    0.320    0.000    0.000    0.000
## Cv12    0.164    0.000    0.107    0.011    0.375    0.003    0.000    0.006
## D10     0.000    0.001    0.000    0.026    0.000    0.338    0.008    0.000
## D11     0.005    0.028    0.020    0.112    0.006    0.002    0.023    0.042
## D107    0.030    0.005    0.000    0.000    0.001    0.276    0.000    0.004
## D116    0.000    0.000    0.011       NA       NA       NA       NA       NA
##      Allele-9 Allele-10 Allele-11 Allele-12 Allele-13 Allele-14 Allele-15
## C2         NA        NA        NA        NA        NA        NA        NA
## C3         NA        NA        NA        NA        NA        NA        NA
## C6      0.005     0.005        NA        NA        NA        NA        NA
## C117       NA        NA        NA        NA        NA        NA        NA
## D131    0.003     0.003        NA        NA        NA        NA        NA
## C26b    0.041     0.000        NA        NA        NA        NA        NA
## Cv12    0.194     0.000        NA        NA        NA        NA        NA
## D10     0.001     0.000     0.002        NA        NA        NA        NA
## D11     0.001     0.000     0.008     0.039     0.012     0.040     0.008
## D107    0.014     0.002     0.000     0.100     0.000     0.021     0.001
## D116       NA        NA        NA        NA        NA        NA        NA
##      Allele-16 Allele-17 Allele-18 Allele-19 Allele-20 Allele-21 Allele-22
## C2          NA        NA        NA        NA        NA        NA        NA
## C3          NA        NA        NA        NA        NA        NA        NA
## C6          NA        NA        NA        NA        NA        NA        NA
## C117        NA        NA        NA        NA        NA        NA        NA
## D131        NA        NA        NA        NA        NA        NA        NA
## C26b        NA        NA        NA        NA        NA        NA        NA
## Cv12        NA        NA        NA        NA        NA        NA        NA
## D10         NA        NA        NA        NA        NA        NA        NA
## D11      0.002     0.022     0.029         0     0.068     0.108     0.001
## D107     0.002        NA        NA        NA        NA        NA        NA
## D116        NA        NA        NA        NA        NA        NA        NA
##      Allele-23 Allele-24 Allele-25 Allele-26 Allele-27 Allele-28 Allele-29
## C2          NA        NA        NA        NA        NA        NA        NA
## C3          NA        NA        NA        NA        NA        NA        NA
## C6          NA        NA        NA        NA        NA        NA        NA
## C117        NA        NA        NA        NA        NA        NA        NA
## D131        NA        NA        NA        NA        NA        NA        NA
## C26b        NA        NA        NA        NA        NA        NA        NA
## Cv12        NA        NA        NA        NA        NA        NA        NA
## D10         NA        NA        NA        NA        NA        NA        NA
## D11          0     0.002         0     0.005     0.052         0     0.019
## D107        NA        NA        NA        NA        NA        NA        NA
## D116        NA        NA        NA        NA        NA        NA        NA
##      Allele-30 Allele-31
## C2          NA        NA
## C3          NA        NA
## C6          NA        NA
## C117        NA        NA
## D131        NA        NA
## C26b        NA        NA
## Cv12        NA        NA
## D10         NA        NA
## D11      0.004     0.002
## D107        NA        NA
## D116        NA        NA

We will use summary one (discussed in lab example 1) since we have individuals missing both alleles

{cat(" summary1 (Chakraborty et al. 1994):", "\n")
round(Null.alleles$null.allele.freq$summary1,2)} 
##  summary1 (Chakraborty et al. 1994):
##                      C2   C3   C6 C117 D131 C26b Cv12  D10  D11 D107 D116
## Observed frequency 0.29 0.66 0.16 0.13 0.14 0.25 0.11 0.25 0.15 0.19 0.37
## Median frequency   0.29 0.67 0.16 0.12 0.13 0.24 0.11 0.25 0.15 0.18 0.36
## 2.5th percentile   0.21 0.49 0.09 0.06 0.09 0.18 0.07 0.20 0.11 0.13 0.27
## 97.5th percentile  0.38 0.85 0.24 0.19 0.19 0.32 0.16 0.31 0.19 0.24 0.48
{cat("summary2 (Brookfield et al. 1996):", "\n")
round(Null.alleles$null.allele.freq$summary2,2)}   
## summary2 (Brookfield et al. 1996):
##                      C2   C3   C6 C117 D131 C26b Cv12  D10  D11 D107 D116
## Observed frequency 0.19 0.14 0.12 0.09 0.12 0.22 0.11 0.22 0.15 0.17 0.21
## Median frequency   0.19 0.14 0.12 0.08 0.12 0.22 0.11 0.22 0.14 0.17 0.21
## 2.5th percentile   0.13 0.09 0.07 0.04 0.08 0.17 0.07 0.17 0.11 0.12 0.16
## 97.5th percentile  0.24 0.20 0.17 0.13 0.17 0.28 0.15 0.27 0.18 0.22 0.26

F. Overall interpretation

Spatial genetic structure:

The Kentucky Arrow Darter data used in this lab includes a great deal of genetic structure.

HWE

Therefore, we expect global estimates (i.e., the whole dataset) of He to be out of HWE due to population substructure (HWE assumes panmictic populations). We would also expect data to be out of HWE when analyzing data by populations due to population substructure. We could, then, test HWE and linkage disequilibrium at the population level.

Linkage:

The smaller samples sizes we have can result in deviations from HWE and in linkage disequilibrium as an artifact of sample size and/or breeding structure (if one male is responsible for all breeding, his genes would appear “linked” in offspring genotypes, even though they are not physically linked in the genome).

So, what do we do? We can look for patterns. Are there loci that are consistently out of HWE across samples sites while other loci are not out of HWE suggesting that there are null alleles or other data quality control issues? With the full data set, this was not the case. Are there loci that are consistently linked across different ponds (while others are not), suggesting that they are linked? With the full dataset, this was not the case.

Null alleles:

Missing data can imply null (non-amplifying) alleles. In this case, while there are loci that have higher “drop-out” rates than others, this is more likely due to some loci being more difficult to call. In addition, some of the samples used in this study were smaller fin clips with lower yields of DNA, resulting in incomplete genotypes. While presence of null alleles is a possibility, genetic structure, breeding patterns at low population size and aggressive quality control of genotypes can all explain the results. Finally, with all of the structure in these data, there are examples at population level of unique alleles that are fixed or nearly fixed.

Assess genetic diversity

These measures are typically quantified per population.

A. Rarefied allelic richness

Both nominal sample size (number of fish sampled) and valid sample size (e.g., for each locus, the number of fish with non-missing genetic data) vary between sites. We would expect the number of alleles found in a population to increase with the number of individuals genotyped.

We can check this by plotting the number of alleles against sample size. Here we create an object ‘Sum’ that contains the summary of the genind object, then we can access its elements by ‘$’ to plot what we need. The function ‘names’ lists the names of the elements, which reduced the guesswork.

Sum <- summary(Kad.genind)
names(Sum)
## [1] "n"         "n.by.pop"  "loc.n.all" "pop.n.all" "NA.perc"   "Hobs"     
## [7] "Hexp"

The site names are quite long, hence we print the labels vertically by setting ‘las=3,’ and we modify the margins (‘mar’). The four numbers give the size of each margin in the following order: bottom, left, top, right.

We add a regression line to the scatterplot with the function ‘abline,’ where we specify the linear regression model with the function ‘lm.’ In this case, we model the response ‘pop.n.all’ as a function of predictor ‘n.by.pop.’

The barchart (left) shows that there is considerable variation among populations in the number of alleles observed across all loci. The scatterplot (right) with the red regression line shows that the number of alleles increases with sample size.

par(mar=c(5.5, 4.5,1,1))
barplot(Sum$pop.n.all, las=3, 
       xlab = "", ylab = "Number of alleles")

plot(Sum$n.by.pop, Sum$pop.n.all, 
       xlab = "Sample size", ylab = "Number of alleles")
abline(lm(Sum$pop.n.all ~ Sum$n.by.pop), col = "red")

Hence we should not compare the number of alleles directly. Instead, we’ll use rarefied allelic richness (Ar).

By default, the function ‘allel.rich’ finds the lowest valid sample size across all populations and loci, and multiplies it by the ploidy level. The number is stored as ‘Richness$alleles.sampled’ (here: 3 individuals * 2 alleles = 6 alleles). Alternatively, this number can be set with the ‘min.alleles’ argument.

Richness <- PopGenReport::allel.rich(Kad.genind, min.alleles = NULL)
Richness$alleles.sampled
## [1] 10

Populations with more alleles are resampled to determine the average allelic richness among the minimum number of alleles. Here, this means that 10 alleles are sampled from each population, allelic richness is calculated, and the process is repeated many times to determine the average).

Let’s plot the results again. The barchart shows that there is considerable variation in genetic diversity among ponds. The scatterplot against sample size (here: for each population, the average number of valid alleles across loci) suggests that the variation is not as realted to sample size as in the first plot above. The regression line (red) is becoming more horizontal and the range on the y axis has decreased by a lot.

Here we plot the average Ar across loci, so that the result does not depend on the number of loci used.

par(mar=c(5.5, 4.5,1,1))
barplot(Richness$mean.richness, las=3, ylab="Rarefied allelic richness (Ar)")

plot(colMeans(Richness$pop.sizes), Richness$mean.richness,
     xlab="Valid sample size", 
     ylab="Rarefied allelic richness (Ar)")
abline(lm(Richness$mean.richness ~ colMeans(Richness$pop.sizes)), col="red")

B. Observed and expected heterozygosity

Note: Writing the ‘genind’ summary into an object ‘Sum’ allows accessing its attributes by name.

  Sum <- summary(Kad.genind)
  names(Sum)
## [1] "n"         "n.by.pop"  "loc.n.all" "pop.n.all" "NA.perc"   "Hobs"     
## [7] "Hexp"

Expected heterozygosity (here: Hexp) is the heterozygosity expected in a population under HWE, and observed heterozygosity (here: Hobs) is the observed number of heterozygotes at a locus divided by the total number of genotyped individuals. Here are the global values (pooled across all populations):

par(mar=c(3, 4.5,1,1))
  barplot(Sum$Hexp, ylim=c(0,1), ylab="Expected heterozygosity")

  barplot(Sum$Hobs, ylim=c(0,1), ylab="Observed heterozygosity")

By locus and population:

Here we use ‘seppop’ to split the genind object by population, then ‘sapply’ to apply function ‘summary’ to each population.

Hobs <- t(sapply(seppop(Kad.genind), function(ls) summary(ls)$Hobs))
  Hexp <- t(sapply(seppop(Kad.genind), function(ls) summary(ls)$Hexp))
  {cat("Expected heterozygosity (Hexp):", "\n")
  round(Hexp, 2)
  cat("\n", "Observed heterozygosity (Hobs):", "\n")
  round(Hobs, 2)}
## Expected heterozygosity (Hexp): 
## 
##  Observed heterozygosity (Hobs):
##      C2   C3   C6 C117 D131 C26b Cv12  D10  D11 D107 D116
## 1  0.54 0.00 0.71 0.62 0.71 0.50 0.92 0.29 0.92 0.78 0.58
## 2  0.00 0.42 0.53 0.58 0.42 0.00 0.47 0.00 0.21 0.74 0.00
## 3  0.46 0.00 0.50 0.69 0.73 0.73 0.73 0.42 0.88 0.77 0.23
## 4  0.46 0.00 0.12 0.00 0.46 0.25 0.67 0.54 0.83 0.04 0.00
## 5  0.00 0.00 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 0.00
## 6  0.00 0.00 0.63 0.30 0.52 0.67 0.89 0.54 0.70 0.77 0.11
## 7  0.36 0.00 0.21 0.57 0.64 0.64 0.71 0.75 0.79 0.64 0.57
## 8  0.00 0.00 0.29 0.19 0.90 0.48 0.71 0.63 0.65 0.52 0.00
## 9  0.82 0.00 0.56 0.83 0.61 0.67 0.67 0.78 0.83 0.88 0.28
## 10 0.40 0.00 0.68 0.44 0.81 0.89 0.69 0.87 0.60 0.80 0.56

Population 5 shows little variation in expected heterozygosity across loci

Let’s plot the average across all loci for each population:

Here we use ‘apply’ to apply the function ‘mean’ to the rows (MARGIN = 1). For columns, use ‘MARGIN = 2.’

par(mar=c(5.5, 4.5, 1, 1))
  Hobs.pop <- apply(Hobs, MARGIN = 1, FUN = mean)
  Hexp.pop <- apply(Hexp, 1, mean) 
  barplot(Hexp.pop, ylim=c(0,1), las=3, ylab="Expected heterozygosity")

  barplot(Hobs.pop, ylim=c(0,1), las=3, ylab="Observed heterozygosity")

C. Create table with sitel-level genetic diversity measures

Kad.diversity <- data.frame(Pop = names(Hobs.pop),
                              n = Sum$n.by.pop,
                              Hobs = Hobs.pop,
                              Hexp = Hexp.pop,
                              Ar = Richness$mean.richness)
Kad.diversity
##    Pop  n       Hobs       Hexp       Ar
## 1    1 24 0.59766140 0.58324965 3.421913
## 2    2 19 0.30622010 0.31453035 1.930518
## 3    3 26 0.55944056 0.56898870 3.259311
## 4    4 24 0.30681818 0.27541035 1.871246
## 5    5  9 0.02020202 0.03647587 1.155860
## 6    6 27 0.46568247 0.46162692 2.727438
## 7    7 14 0.53512397 0.54164116 2.877023
## 8    8 31 0.39762952 0.43281840 2.776163
## 9    9 18 0.62982769 0.60697080 3.713312
## 10  10 21 0.61251176 0.64287164 3.750409

You can save the R object ‘Kad.diversity’ with the code below

##if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))
#save(Kad.diversity, file = paste0(here(),"/output/Kad.diversity.RData"))
#load(paste0(here(),"/output/Kad.diversity.RData"))

Aggregate genetic data at population level (allele frequencies)

For some analyses, we will need to aggregate data from the individual to the population level, e.g. as a table of allele frequencies per population.

Here we convert the ‘genind’ object to a ‘genpop’ object (NOT the same as a ‘genepop’ object!). This is defined in the package ‘adegenet’ to hold population-level genetic data. The function ‘genind2genpop’ obviously converts from ‘genind’ to ‘genpop.’

Kad.genpop <- adegenet::genind2genpop(Kad.genind)
## 
##  Converting data from a genind to a genpop object... 
## 
## ...done.

The function ‘makefreq’ extracts the table with allele frequencies from the ‘genpop’ object. We’ll plot just a few lines and alleles.

Freq <- adegenet::makefreq(Kad.genpop)
## 
##  Finding allelic frequencies from a genpop object... 
## 
## ...done.
round(Freq[1:6,1:10], 2)
##   C2.141 C2.145 C2.133 C2.129 C2.148 C3.214 C3.210 C3.212 C6.272 C6.280
## 1   0.38   0.56   0.06      0      0   1.00   0.00      0   0.40   0.42
## 2   0.00   1.00   0.00      0      0   0.53   0.47      0   0.68   0.00
## 3   0.67   0.29   0.04      0      0   1.00   0.00      0   0.42   0.00
## 4   0.31   0.00   0.69      0      0   1.00   0.00      0   0.06   0.00
## 5   1.00   0.00   0.00      0      0   0.89   0.11      0   0.00   0.00
## 6   1.00   0.00   0.00      0      0   0.89   0.11      0   0.19   0.00

The allele frequencies of all alleles from the same locus (e.g., C2.141) should sum to 1 for each population. With 11 loci, the row sums should thus add to 11.

apply(Freq, MARGIN = 1, FUN = sum)    # Just checking
##  1  2  3  4  5  6  7  8  9 10 
## 11 11 11 11 11 11 11 11 11 11